<html>
<head>
<link rel=stylesheet href="style.css" type="text/css">
<title>collectl - Process Monitoring</title>
</head>

<body>
<center><h1>Process Monitoring</h1></center>
<p>
<h3>Introduction</h3>
Collectl has the ability to monitor processes in pretty much the same way as
ps or top do as can be see here:

<div class=terminal-wide14>
<pre>
# PROCESS SUMMARY (faults are /sec)
# PID  User     PR  PPID S   VSZ   RSS  SysT  UsrT Pct  AccuTime MajF MinF Command
21502  root     15  1749 S    6M    2M  0.00  0.00   0   0:06.40    0    0 /usr/sbin/sshd
21504  root     15 21502 S    4M    1M  0.00  0.00   0   0:00.79    0    0 -bash
22984  root     15     1 S    7M    1M  0.00  0.00   0   0:00.78    0    0 cupsd
23073  apache   15  1914 S   18M    8M  0.00  0.00   0   0:00.01    0    0 /usr/sbin/httpd
</pre>
</div>

You can select processes to monitor by pid, parent, owner or
command name.
When using names, you can use partial or full names or even use strings that
were part of the command invocation string such as parameters.  
The main benefit of monitoring processes with
collectl is that you can coordinate the sample times of process data with
any of the other subsystems collectl can monitor.
<p>
The way you tell collectl to monitor processes is to specify the Z subsystem
and any optional parameters with -Z (sorry, but -P was already taken).  Since
monitoring processes is a heavier-weight function, it is recommended to use a
different interval, which can be specified after the main monitoring interval
separated by a colon.  The default is 60 seconds.  Therefore, to monitor all
the processes once every 20 seconds and the rest of the parameters every 5
simply say:

<pre>
collectl -sZ -i5:20
</pre>

The biggest mistake people make when running this command interactively is to
leave off the interval or specificy something like -i1 and not see any process
data.  That is because the default interval is 60 seconds and they just
haven't waited long enough for the output!  This should obvious since collectl
will announce it is <i>waiting for a 60 second sample</i>.
<p>
There are a few restrictions to the way these intervals are specified.  The
process interval must be a multiple of the main interval AND cannot be less
than it.  If you specify a process interval without a main interval, the main
interval defaults to the process interval.
<p>
To monitor a subset of processes use the -Z switch followed by one or more
process selectors, separated by commas.  If a plus sign immediately follows a
process selector any processes selected by it will have their threads
monitored as well.  See <i>collectl -x</i> or <i>man collectl</i> for more details.
<p>
Finally, as with other data collected by collectl, you can play back process
data by specifying -p.  While not exactly plottable data, you can specify
-P and the output will be written to a separate file as time stamped space
delimited data, one process per line.
<p>
<h3>Dynamic Process/Thread Monitoring</h3>
A unique feature of process monitoring is that processes specified with a
selection list via -Z do
not have to exist at the time collectl is run.  In other words, collectl will
continue to look for new processes that match this selection list during every
collection cycle!  While this is indeed a good thing if that's
what you want to do, it does
come with a price in overhead: <i>not a lot, but overhead never-the-less</i>.  If
you do not want this effect and only want to look at those processes that match
the selection list at the time collectl is started, specify -OP to suppress
this behavior.
<p>
This holds for process threads as well.  If you use -OP you will not see
threads that were created after collectl starts.
<p>
Perhaps the best way to see this in effect is to run collectl with the
following command:

<pre>
collectl -i:.1 -sZ -Zfabc -oh
</pre>

noting a few tricks.  First of all, the .1 for an interval is not a mistake.
It is there to show that you
can indeed use collectl to spot the appearance of short
lived processes - <i>just don't do it unless you really need to</i>.  The pupose of
the -oh is to suppress headers which can be really annoying in this mode (try
it without it and see what I mean).  Finally, the -Z switch is saying to look
for any processes invoked with a command that contained the string 'abc' in it.
When this command is invoked there shouldn't be any output unless someone IS
running a command with 'abc' in it.  Now go to a
different window or terminal and edit the file abc with your favorit editor.  
You will immediately see
collectl output and when you exit the editor the output will stop.
<p>
<h3>The Time Fields</h3>
The SysT and UsrT represent the system and user time the line item spent during
the current interval.  One might think this means that in a 60 second interval the
most time a process could spend is 60 seconds.  Not quite!  If this is a
multi-processor/multi-core system the process could actually spend up to 60 seconds
on each core, so just be careful how the times are interpretted.  The Pct field is
the percentage of the current interval the process had consumed in system and user
time, which can also exceed 100% in multi-processor situations.  Finally, since the
AccuTime field accumulates these times it can exceed the actual wall clock time.
<p>
When run in non-threaded mode, the times reported include all time consumed by
all threads.  When run in threaded mode, times are reported for indivual
threads as well as the main process.  In other words, if a
process's only job is to start threads, it will typically show times of 0.  If
you rerun collectl in non-threaded mode you will see it report aggregated
times.
<p>
<h3>Process Memory Utilization</h3>
The types of memory utilization dispayed as part of the process monitoring output
are the <i>Virtual</i> and <i>Resident</i> sizes.  However there are additional
type of memory that collectl tracks and to see this you can select alternate
process display format as follows:

<div class=terminal-wide14>
<pre>
# collectl --procmem -i:1 -c1
# PID  User     S VmSize  VmLck  VmRSS VmData  VmStk  VmExe  VmLib Command
21502  root     S  6896K      0  2168K   512K    36K   268K  3304K /usr/sbin/sshd
21504  root     S  4392K      0  1496K   296K    24K   592K  1356K -bash
22984  root     S  7484K      0  1884K  1692K    88K   224K  3248K cupsd
23073  apache   S 18140K      0  8392K  2152K    56K   292K 12596K /usr/sbin/httpd
</pre>
</div>

<h3>Process I/O Statistics</h3>
As of collectl Version 2.4.0, if process i/o stats have
been built into the kernel collectl will add 2 additional columns to the
process display named <i>RKB</i> and <i>WKB</i>,
noting in the following example I've set the display interval to
1 second and removed the initialization message from the output.  As with all
fields reported as rates/sec these will show consistent values independent of
the interval and if you want the <i>unnormalized</i> value be sure to include
that option with the -o switch as -on.

<div class=terminal-wide14>
<pre>
# collectl -sZ -i:1
# PROCESS SUMMARY (faults are /sec)
# PID  User     PR  PPID S   VSZ   RSS  SysT  UsrT Pct  AccuTime  RKB  WKB MajF MinF Command
    1  root     20     0 S    4M  552K  0.00  0.00   0   0:00.68    0    0    0    0 init
    2  root     15     0 S     0     0  0.00  0.00   0   0:00.00    0    0    0    0 kthreadd
    3  root     RT     2 S     0     0  0.00  0.00   0   0:00.02    0    0    0    0 migration/0
</pre>
</div>

A particularly useful feature I've found is monitoring one or more processes by name (you
can also monitor by pid, ppid and uid) to see
what they're doing.  In this case I'm using the dt program to write a large file and telling
collectl to display any process whose command string matches <i>dt</i> as well as to include
time stamps.

<div class=terminal-wide14>
<pre>
# collectl -sZ -i:1 -Zcdt -oT
# PROCESS SUMMARY (faults are /sec)
#          PID  User     PR  PPID S   VSZ   RSS  SysT  UsrT Pct  AccuTime  RKB  WKB MajF MinF Command
09:01:03 13577  root     20 12775 R    1M    1M  0.04  0.00   4   0:01.92    0  16K    0    0 ./dt
09:01:04 13577  root     20 12775 D    1M    1M  0.40  0.00  40   0:02.32    0 118K    0    0 ./dt
09:01:05 13577  root     20 12775 D    1M    1M  0.24  0.00  24   0:02.56    0  65K    0    0 ./dt
</pre>
</div>

Finally, note that there is more process i/o data available but I chose to leave it off
the default display and instead have the following alternate format. This is
the same methodology used for reporting process memory utilitation, namely you only
see VSZ and RSS in the default display but much more with --procmem.  Also note in this case
I chose 1/2 second monitoring as well as showing time in msec resolution:

<div class=terminal-wide14>
<pre>
# collectl --procio -i:.5 -Zcdt -oTm
#              PID  User     S  SysT  UsrT   RKB   WKB  RKBC  WKBC  RSYS  WSYS  CNCL  Command
09:03:24.003 13614  root     D  0.12  0.00     0   32K     0   32K     0    64     0  ./dt
09:03:24.503 13614  root     D  0.14  0.00     0   32K     0   32K     0    64     0  ./dt
09:03:25.003 13614  root     R  0.10  0.00     0   24K     0   24K     0    48     0  ./dt
</pre>
</div>

Naturally, as with all other data in collectl, you can record it to a file and play it
back later also using --procio and --procmem.
<p>
<h3>Understanding Processing Overhead</h3>
This is intended to be a brief description of how process monitoring works with
the hope that it will help use the capability more efficiently.
<p>
Collectl maintains 2 main lists of monitoring information: <i>pids to monitor</i>
and <i>pids to ignore</i>.  These lists are built at the time collectl starts, so if
-OP is not specified, the effect is to execute a ps command and save all the
pids in the to-be-monitored list.  If filters are specified with -Z, only those
pids that match are placed in to-be-monitored and the rest placed in the
do-not-monitor list.
<p>
If collectl is only monitoring a specific set of processes, either because -OP
was specified or -Z was used and only specified specific pids (not ppids), on
each monitoring pass collectl only looks at the pids in the to-be-monitored
list.  In other words, this is as efficient as it gets.
<p>
If doing dynamic process monitoring, every monitoring pass collectl has to
read /proc to get a list of ALL current processes.  While it ignores any in
do-not-monitor, it must look at the rest.  If any of these are in the
to-be-monitored list and have had thread monitoring requested, additional work
is required to see if any new threads have shown up.
Any processes not in to-be-monitored are obviously NEW processes and must
then be examined to see if they
match any selection criteria and this involves
reading the /proc/pid/stat file.  That pid is
then placed in one of the two lists.
It should be understood that during any particular interval a lot of processes
come and go, such as cat, ls, etc.  However, these are short lived enough as to
not even be seen by collectl, unless of course collectl is running at a very
fine grained monitoring level.
<p>
Occasionally a process being monitored disappears because it had terminated.
When this happens its pid is removed from the to-be-monitored list.
<p>
Finally, these data structures (and a couple of others that have not been
described) need maintenance to keep them from growing.  If the number of
processes to monitor has been fixed, this maintenance is significantly reduced.
<p>
So the bottom line is if you have to use dynamic monitoring, try to bound the
number of processes and/or threads.  If you really need to see it all, don't
be afraid to but just be mindful of the overhead.  Collecting all process
data with the default interval has been observed to take about 1 minute of CPU
time, which is less than 0.1%, on a lightly loaded Proliant DL380, but that load
will be higher with more active process.
<p>
<h3>RESTRICTIONS</h3>
<ul>
<li>You cannot specify -Z during playback mode.  If you need to look at a subset of
the data consider using a filter like <i>grep</i>.</li>

<li>Thread monitoring is limited to 2.6 kernels.</li>

</body>
</html>
